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Abstract 

We present a new framework for the assessment and calibration of medium range ensemble tem- 
perature forecasts. The method is based on maximising the likelihood of a simple parametric model 
for the temperature distribution, and leads to some new insights into the predictability of uncertainty. 



1 Introduction 

A number of dif ferent metho d s have been used f o r the assessment and calibr ation of ensemble forecas ts 
(for example see lAtgerl l|l999j) . iRichardsonl l|200o() . iRoulston and Smith! (|2002^ and I Wilson et alJ l|l999|) b 
In many applications of ensemble forecasts the forecast is used to derive the probability of a certain 
outcome, such as temperature dropping be low zero. In this context, the reliabil i ty dia g ram is an appro- 
priate met hod for assessi ng reliability (see lAndersonI l|l996|) , lEckel and Walters! lll998lb iTalaerand et alJ 
ill997l) andlHamilll &l99lt)) and the relative operating characteristic (ROC) l|Masonl l)l982|) . ISwetsl l|l988TT 
Maso n and Graham! lj l999)) is an appropriate method to evaluate the resolution. 

In other applications of ensemble forecasts, however, the forecast is interpreted as providing a mean 
and a distribution of future values of temperature. For example in the field of weather derivatives the 
calculation of the fair strike for a certain class of weather swap contract 1 needs an estimate of the mean of 
the future temperatures, while the calculation of the fair prem ium for weather option contra cts needs an 
estimate of the whole distribution of future temperatures f see I Jewson and Caballerol l|2002f) for details). 
Additionally, the assumption is often made that temperature is normally distributed since this allows the 
temperature forecast to be summarised succinctly using just the mean and the standard deviation. For 
such mean-and-distribution or mean-and-standard deviation based applications of ensemble forecasts the 
reliability diagram and the ROC are not particularly appropriate. 

In this paper we present a new framework for the assessment and calibration of ensemble temperature 
forecasts based on analysis of the mean and standard deviation of the distribution of temperatures. The 
method has been developed to respond to the need for a simple and practical method for assessment and 
calibration that can be used by companies that make use of ensemble forecasts in the weather derivative 
market. We postulate a parametric model for the mean and standard deviation and fit the parameters 
of the model using the maximum likelihood method. This approach has a number of advantages relative 
to the assessment and calibration methods mentioned above. The model is simple, easy to interpret, 
and the entire ensemble distribution can be calibrated in one simple step. Also the model gives a clear 
indication of how many days of useful information there are in a forecast. 

In section 2 we describe the data sets we use for this study. In section 3 we describe the statistical model 
that forms the basis for the method we propose. In section 4 we describe the results from fitting the 
model. In section 5 we discuss extensions to other distributions and in section 6 we summarise our results 
and draw some conclusions. 



2 Data 

We will base our analyses on one year of ensemble forecast data for the weather station at London's 
Heathrow airport, WMO number 03772. The forecasts are predictions of the daily average temperature, 
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and the target days of the forecasts r un from 1st January 2002 to 31st December 2002. The forecast was 
produced from the ECMWF model l|Molteni et all Il996^) and downscaled to the airport location using 
a simple interpolation routine prior to our analysis. There are 51 members in the ensemble. We will 
compare these forecasts to the quality controlled climate values of daily average temperature for the same 
location as reported by the UKMO. 

Throughout this paper all equations and all values have had both the seasonal mean and the seasonal 
standard deviation removed. Removing the seasonal standard deviation removes most of the seasonality 
in the forecast error statistics, and justifies the use of non-seasonal parameters in the statistical models 
for temperature that we propose. 

3 The Moment-based Ensemble Assessment and Calibration Model 



For forecasts of temperature anomalies, it has long been recognized (see for exam nle lLeithl (|l9^J)) that 
the use of a final regression step between ensemble mean and observations can eliminate bias and minimise 
the mean square error (MSE). For our purposes we will write this regression step as: 

Ti~N(a + /3mi,tr) (1) 

where Ti is the observed temperature on day i, N(p,a) represents a normal distribution with mean fj, 
and standard deviation a, mi is the forecast of the temperature (in our case, the ensemble mean) and a, 
P and a are free parameters. This regression model postulates that temperatures come from a normal 
distribution with mean given by fa = a + (3rrii and standard deviation given by a. The values for a, (3 
and a come from fitting the model, and this is usually done using least squares linear regression. One 
justification for the use of least squares linear regression is that for this particular mo del it is equivalent to 
finding the parameters that maximize the likelihood of the data given the model fsee lPress et al.l (^992)), 
as long as we assume that the forecast errors are uncorrelated in time. We note that although the model 
in equation 1 postulates that the data come from a normal distribution, it can be applied in situations 
in which the data is not strictly normal, and in fact it is common (although perhaps bad) practice not 
to test for normality when doing such linear regressions. 

One of the assumptions in this model is that the standard deviation of the f orecast errors a is constant . 
However it is well documented that the size of forecast errors varies in time l|Palmer and Tibaldi| . |l98S ) 
and t hat there is a relationship between the ensemble spread and the size of forecast errors |ToThet"aL , 
2000). It thus makes sense to attempt to generalize the model in equation to a model that takes these 
temporal variations in a into account. We will do this using the model: 

Ti~ N(a + Pmi,j + Ssi) (2) 

where the free parameter a has been replaced by a linear function of the ensemble spread si , and two new 
parameters 7 and S have been introduced. Modelling the standard deviation as a linear function of the 
ensemble spread in this way allows for both time variation and the correction of biases in the predicted 
uncertainty. 2 

The optimum parameters for this model can no longer be fitted using least squares linear regression. 
However, they can be fitted if we can identify a cost function that can be minimised or maximised by 
varying the parameters. There are various possibilities for such a cost function, but one of the most natural 
is the likelihood, defined as the probability density of the observations given the calibrated forecast. 
Maxim ising the likelihood is the sta ndard way to fit parameters i n statistics (see for instance textbooks 
such as lCasella and Bergerl ll2002h or lLehmann and Casellal 1^998^. and gives the most accurate possible 
estimates of the parameters for most statistical models. 

As with the linear regression model, this model is also not restricted to cases in which temperature is 
exactly normally distributed: the assumption of the normal distribution merely provides a metric in 
which the likelihood can be calculated and the parameters fitted. This metric is most appropriate when 
the data is at least close to normally distributed. For cases when the data is not close to normal other 
distributions can be used, or the data can be transformed to normal. 
There are a number of useful features of the model we present. These include: 

• Once the parameters have been fitted to past historical data, calibration of future ensemble forecasts 
is easy since it just involves applying linear transformations to the ensemble mean and standard 
deviation. The calibrated values for the mean and the standard deviation can be used to define the 
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whole forecast distribution, or can be used to shift and stretch the individual ensemble members, 
if individual ensemble members need to be preserved. In the latter case non-normality in the 
distribution of the original ensemble members will not be destroyed. 

• The optimum values of the parameters in equation 2 have clear interpretation and give us useful 
information about the performance of the ensemble. For instance a identifies a bias in the mean, and 
represents a scaling of the forecast towards climatological values. In a perfect forecast, a would 
be zero and would be one. The spread parameters 7 and S combine to optimize the prediction 
of uncertainty about the mean. The value of the ensemble spread s varies in time because of the 
dependence of the growth rate of differences between ensemble members on the actual model state. 
The calibrated standard deviation value <Xj = 7 + #Sj additionally includes uncertainty due to model 
error. If the spread of the ensemble contains very little real information, 5 will tend to be small, 
and 7 will tend to be large to compensate. 

• It is very easy to calculate approximate uncertainty levels on the values of the parameters as part 
of the fitting procedure. This is done using the curvature of the log-likelihood at the maximum 
(see the above references on likelihood methods) . These uncertainty levels give us a clear answer to 
the question of whether the ensemble forecast has useful skill at different lead times. For instance, 
once is not significantly different from zero we can say that the ensemble mean no longer contains 
useful information (at least not within this framework) and once 5 is not significantly different from 
zero then we can say that the ensemble spread no longer contains useful information. This raises 
the interesting possibility that we might identify situations in which the mean may contain more 
days of useful information than the spread. 

• It is often necessary to decide which of two forecasts is the more accurate. If two forecasts are both 
calibrated using equation [21 then the log-likelihood provides a natural way to compare the forecasts. 
Log-likelihood measures the ability of the forecast to represent the whole distribution of observed 
temperatures, and is a generalisation of mean square error. It can be presented in a number of ways 
such as log-likelihood or log-likelihood skill score. 

Forecasts calibrated using equation will not necessarily minimise MSE. Users interested purely in a 
single forecast that minimises MSE should thus calibrate using equation ^ However, users interested 
in predictions of uncertainty, or, equivalently, in the whole distribution of possible temperatures, should 
calibrate using equation[5] In practice we have found that the mean temperature prediction produced by 
equation |21 is close to that produced by equation 2] presumably because the fluctuations in uncertainty 
are not large. 

4 Results 

The optimum values for the parameters in equation 2 for our 1 year of forecast data and observations are 
shown in figure^ I n each case we show the approximate 95% sampling error confidence intervals around 
the optimum parameters. In some cases they are so narrow that they are hard to see in the graphs. 
Looking at a we see that there is a small and roughly constant bias in the temperatures produced by 
the ensemble. Correction of the ensemble mean (or each ensemble member) using a would eliminate this 
bias, as long as the ensemble stays stationary. 

The parameter is slightly less than I at all leads. This shows that the ensemble mean varies too much: 
either the ensemble mean, or each ensemble member, should be reduced by the factor towards the 
climatology. Such a damping factor is presumably required because the ensemble members are more 
correlated with each other than they are with the observations and because the ensemble is finite in size. 
Even at lead 10 is highly significantly different from zero, implying that the ensemble mean still contains 
useful predictive ability at that lead. If we allow ourselves to extrapolate the curve to longer leads by 
eye, it would seem likely that the ensemble mean would still contain useful predictive information even 
beyond that. 

The fact that our values of 5 are significantly different from zero out to the end of the forecast (just) shows 
that there is significant information in the ensemble spread too. However, in this case if we extrapolate 
to higher lead times by eye it seems unlikely that there would be any more skill in 5. Since S is below 
one and 7 is non-zero we see that the standard deviation of the ensemble is not an optimal estimate of 
the uncertainty of the prediction. 

The 7 + Ss transformation can change both the mean spread (the time mean of the standard deviation 
across the ensemble) and the variability of that spread (the standard deviation in time of the standard 



deviation across the ensemble). To measure the effect on the mean spread, figure |2 shows values of 7 ~L l5s 
(where the overbar indicates the mean in time over the year of data) which shows the factor by which 
the transformation increases the mean spread. We see that at short lead times, the ensemble spread s 
is far too small on average and the calibration increases the spread by factors of around 4 (at lead 0) 
and 2 (at lead 1). At longer lead times the ensemble spread is still too small on average by a factor of 
around 1.2. Th is underestimatio n of t he spread from ensemble f orecasts has been noted by a number of 
authors such as lZiehmannl 1 2000^1 and lMullen and Buizzal l|2000|L It is likely to be due to model error in 
the prediction model and due to the fact that the forecast is a prediction of a large scale flow while the 
observation is site-specific and hence affected by small-scale variability not represented in the model. 
The size of the effect of the calibration on the variability (in time) of the spread is given by the value 
of 6. Since S is significantly different from one at all lead times beyond the first we conclude that the 
variability of the spread from the ensemble needs to be reduced to be optimum at those lead times. This 
could be because the variability of the ensemble spread is too large, or because the variability of the 
ensemble spread is not highly correlated with the real variability of skill. 

We can see from the values of S that the variability of the ensemble spread alone will overestimate the 
state-dependent predictability of this model by a large factor at long leads. A better estimate for the 
level of state-dependent predictability is given by the variability of the calibrated spread, which is smaller 
by the S factor. 

Figure |31 shows the ratio of the standard deviation of the ensemble spread to the mean ensemble spread 
at different lead times. We call this ratio the coefficient of variation of the spread (COVS). Figure El 
shows the COVS estimated from both the uncalibrated and the calibrated ensemble data. These values 
give an indication of how much extra information we get about the forecast uncertainty by using the 
(uncalibrated or calibrated) spread of the ensemble rather than using a level of uncertainty which is 
constant with time. The uncalibrated data suggests that variations in uncertainty that are 20% to 55% 
of the mean uncertainty are predictable using the ensemble spread. However, because the uncalibrated 
data both underestimates the total spread (the numerator in the COVS) and overestimates the predictable 
part of the variability of the spread (the denominator in the COVS) these values seem to be overestimates. 
The calibrated data suggests that variations in the uncertainty that are only 5% to 20% of the mean 
uncertainty are predictable using the ensemble spread. 

5 Other distributions 

In cases where the forecast errors are not close to normally distributed, one can use other distributions. 
For example in the case where the forecast errors show skew the skew-normal distribution SN can be 
used. The skew-normal distribution is a generalisation o f the normal d istribution which has a third 
parameter, and includes the possibility of modelling skew l|AzzaliniLll985|) . Suppressing the index i for 
clarity wc then have: 

T ~ SN(a + (3m,j + 5s,( + vk) (3) 
where we have introduced the ensemble skew k and two new parameters £ and r/. 

The skew-normal model can be fitted using maximum likelihood methods exactly as for the normal 
distribution. One of the results from such a fitting process would be a clear indication as to whether the 
forecast being calibrated does or does not contain statistica lly significant information about t he skew of 
observed temperatures (this question has been discussed bv lDenholm-Price and Mvlnel ll2002^ 'l. 
For extremely non-normal distributions for which even the skew-normal is not non-normal enough, non- 
parametric distributions may be more appropriate. A simple non-param etric method would be to use 
a kernel density, with a single free parameter for the width of each kernel fsee lBowman and Azzali nl lll997l) 
for a d escription of kernel densities). Such a method would look a little like the method of lRouTsTmiand Smith! 
( 2003) even though it is justified in a completely different way. 



6 Conclusions 

We have described a simple parametric method for the assessment and calibration of ensemble tem- 
perature forecasts. The method consists of applying linear transformations to the mean and standard 
deviations from the ensemble. The parameters of the model can be fitted easily using the maximum 
likelihood method. The model has various advantages and disadvantages relative to other calibration 
models currently in use. The main disadvantage is that the model only works for forecast errors that are 



reasonably close to normally distributed, although extensions have been described that should overcome 
that limitation. The advantages of the model are that: 

• the calibration of forecasts using the model is extremely simple 

• the model is transparent and easy to understand 

• the model separates skill in predictions of the mean and the spread 

• calculating confidence intervals on parameters is easy 

• the model gives a clear indication of how many days of useful skill there are in a forecast 

We have applied the model to one year of site-specific ECMWF ensemble forecasts. We find that the 
forecasts have highly significant skill for predicting both the mean and the standard deviation out to 10 
days. The forecasts underestimate the mean uncertainty, as has been reported in other studies. They 
also over-estimate the variability of the uncertainty. For these forecasts we estimate that the predictable 
part of the uncertainty is only between 5% and 20% of the mean uncertainty, depending on lead time. 
For some applications this variability in the uncertainty may be small enough that it can be ignored and 
one could make the simplifying assumption that the uncertainty is constant in time. 
Further work includes: 

• Developing algorithms that avoid having to make the assumption that the forecast error is uncor- 
rected in time. 

• Out of sample testing of the calibrated forecasts, using both measures from within the framework 
(i.e. likelihood) and also other measures such as rank histograms, reliability diagrams and ROCs. 
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Figure 1: The optimum values for the parameters in equation [21 (solid line), 95% confidence intervals 
(dotted line) and the constant values and 1 (dashed line) 




Figure 2: The ratio of the time mean of the standard deviation of the calibrated ensemble to that of the 
uncalibrated ensemble. 




Figure 3: Both lines show the ratio of the standard deviation in time of the standard deviation across 
the ensemble to the mean in time of the standard deviation of the ensemble. This ratio is given the name 
coefficient of variation of spread (COVS) in the text. The solid line was estimated using the uncalibrated 
ensemble, and the dotted line using the calibrated ensemble. 



